ML with Tidymodel

Machine-Learning
R
Basics
Author

F.L

Published

April 21, 2024

Introduction

This notebook summaries key point from Hadley Wickham’s Tidy Model with R. The book only covers basic usage of tidy-model and some other dimension reduction techniques.

Link to the Book: https://www.tmwr.org/

Introduction of Data

Code
library(tmap)
library(osmdata)
library(tidymodels)
data("ames")

## refer to tmap: https://r-tmap.github.io/tmap-book/visual-variables.html
## for osm data: https://cran.r-project.org/web/packages/osmdata/vignettes/osmdata.html
## for query streets: https://wiki.openstreetmap.org/wiki/Key%3ahighway

ames_sf = sf::st_as_sf(ames,coords = c("Longitude","Latitude"), crs=4326)
ames_bbox = sf::st_bbox(ames_sf)
osm_streets = opq(bbox = ames_bbox) |> 
  add_osm_feature(key="highway",value = c(
                                          'secondary'
                                          ,'primary'
                                          ,'tertiary'
                                          ,'unclassified'
                                          ,'residential')) |> 
  # add_osm_feature(key="highway",value = 'motorway') |> 
  osmdata_sf()

## view a intersection
streets_sf = sf::st_intersection(sf::st_as_sfc(ames_bbox), osm_streets$osm_lines)

tm_shape(streets_sf) + 
  tm_lines(col='grey') +
tm_shape(ames_sf) + 
  tm_dots( shape = "Lot_Shape"
          ,col="Neighborhood"
          ,style = "cont"
          ,size=0.05
          ,border.col=NA
          ,border.lwd=0.01) + 
  tm_layout(legend.show=FALSE)
Warning: tm_scale_continuous is supposed to be applied to numerical data

ames familar with this data may come handy compare different model output later.

Code
library(tidymodels)
# 
ggplot2::theme_set(theme_minimal())
tidymodels_prefer()

ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white")

First thing they want to tell you is the data is not normal so require you to normalise somehow.

Code
ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white") +
  scale_x_log10()

ames <- ames |> mutate(Sale_Price = log10(Sale_Price))
# ames |> 
#   head(1) |> 
#   glimpse()

Spoil Alert

Following code create a linear model. Prediction uses these variables:

## preview columns in ames data
ames |> 
  select(Neighborhood, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) |> 
  slice_sample(n=1) |> 
  glimpse()
Rows: 1
Columns: 6
$ Neighborhood <fct> Mitchell
$ Gr_Liv_Area  <int> 974
$ Year_Built   <int> 1991
$ Bldg_Type    <fct> OneFam
$ Latitude     <dbl> 41.98651
$ Longitude    <dbl> -93.60664

Ther are their transformation:

  • Neighborhood: convert low frequency one to “other”, then make dummy varible
  • Gr_Liv_Area: log10 treatment
  • Year_Built: year
  • Bldg_Type: convert building type into dummy varible
  • Latitude: spine function treatment
  • Longitude: spine function treatment
library(tidymodels)
data(ames)

## Normalise Prediction
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

## Split Data Sets
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

## Recipy for Preprocessing Data, Build receipy object
ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)
  
## Linear Model
lm_model <- linear_reg() %>% set_engine("lm")

## Finaly Evaluate Lazy Object
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

## Fit a Model
lm_fit <- fit(lm_wflow, ames_train)

The basics

Splitting/Feature Selection/Create a “Data Budget”

library(tidymodels)

Simple 80-20 split

The basics is the same, split train test. For this purpose you are splitting the data by 80-20.

ames_split <- rsample::initial_split(ames, prop = 0.80)
ames_split
<Training/Testing/Total>
<2344/586/2930>

Regards to spliting portion here is the advice from the Book:

A test set should be avoided only when the data are pathologically small.

ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)
dim(ames_train)
[1] 2344   74

Validation Split 60-20-20

set.seed(52)
# To put 60% into training, 20% in validation, and 20% in testing:
ames_val_split <- rsample::initial_validation_split(ames, prop = c(0.6, 0.2))
ames_val_split
<Training/Validation/Testing/Total>
<1758/586/586/2930>
ames_train <- training(ames_val_split)
ames_test <- testing(ames_val_split)
ames_val <- validation(ames_val_split)

Concepts

  • independent experimental unit: (knowing database basic this is just matter of object uid versus alternate uid) for example, measuring one patient
  • multi-level-data/multiple rows per experimental unit:

Data splitting should occur at the independent experimental unit level of the data!!!

Simple resampling across rows would lead to some data within an experimental unit being in the training set and others in the test set.

Pracrtical Implication

  • the book admit the practice of train and split at first for a validation of the model but follow up using all the data point possible for a better estimation of data.

Fitting Model with Parsnip

  • linear_reg
  • rand_forest

Linear Regression Family

  • lm
  • glmnet: fits generalised linear and model via penalized maximum likelihood.
  • stan
# switch computational backend for different model
linear_reg() |> 
  set_engine("lm") |> 
  translate()
Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
#  regularized regression is the glmnet model 
linear_reg(penalty=1) |> 
  set_engine("glmnet") |> 
  translate()
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    family = "gaussian")
# To estimate with regularization, the second case, a Bayesian model can be fit using the rstanarm package:
linear_reg() |> 
  set_engine("stan") |> 
  translate()
Linear Regression Model Specification (regression)

Computational engine: stan 

Model fit template:
rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
    weights = missing_arg(), family = stats::gaussian, refresh = 0)
lm_model = linear_reg() |> 
  set_engine("lm") |> 
  translate()

lm_model |> 
  fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
parsnip model object


Call:
stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)

Coefficients:
(Intercept)    Longitude     Latitude  
   -313.623       -2.074        2.965  
lm_xy_fit <- 
  lm_model %>% 
  fit_xy(
    x = ames_train %>% select(Longitude, Latitude),
    y = ames_train %>% pull(Sale_Price)
  )

lm_xy_fit
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)    Longitude     Latitude  
   -313.623       -2.074        2.965  

Tree Model

rand_forest(trees = 1000, min_n = 5) %>% 
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  translate()
Random Forest Model Specification (regression)

Main Arguments:
  trees = 1000
  min_n = 5

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    num.trees = 1000, min.node.size = min_rows(~5, x), num.threads = 1, 
    verbose = FALSE, seed = sample.int(10^5, 1))

Capture Model Results

Raw original way (useful to check og documentation)

lm_form_fit <- 
  lm_model %>% 
  # Recall that Sale_Price has been pre-logged
  fit(Sale_Price ~ Longitude + Latitude, data = ames_train)

lm_form_fit %>% extract_fit_engine() %>% vcov()
            (Intercept)    Longitude     Latitude
(Intercept)  273.852441  2.052444651 -1.942540743
Longitude      2.052445  0.021122353 -0.001771692
Latitude      -1.942541 -0.001771692  0.042265807
model_res <- 
  lm_form_fit %>% 
  extract_fit_engine() %>% 
  summary()

# The model coefficient table is accessible via the `coef` method.
param_est <- coef(model_res)
class(param_est)
[1] "matrix" "array" 
param_est
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -313.622655 16.5484876 -18.95174 5.089063e-73
Longitude     -2.073783  0.1453353 -14.26896 8.697331e-44
Latitude       2.965370  0.2055865  14.42395 1.177304e-44

The Tidy ecosystem for model result

What’s good about tidy is so you can reuse model.

tidy(lm_form_fit)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -314.      16.5       -19.0 5.09e-73
2 Longitude      -2.07     0.145     -14.3 8.70e-44
3 Latitude        2.97     0.206      14.4 1.18e-44

Model Workflow

Chapter Link: workflows

Similar in Python or Spark this is called “pipeline”

  1. Initiate a workflow use workflow()
  2. Add whatever model
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

lm_model <- linear_reg() %>% set_engine("lm")
## set up parsnip linear model
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

## add this model to workflow (pipline)
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model)

lm_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: None
Model: linear_reg()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

R formula is now used as a “pre-processor”

lm_wflow <- 
  lm_wflow %>% 
  add_formula(Sale_Price ~ Longitude + Latitude)

lm_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Sale_Price ~ Longitude + Latitude

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Update Fomula

It is possible to update the formula to this:

lm_fit %>% update_formula(Sale_Price ~ Longitude)
lm_wflow <- 
  lm_wflow %>% 
  remove_formula() %>% 
  add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))
lm_wflow

The Role of Formula:

  • inline transformations (e.g., log(x));
  • creating dummy variable columns;
  • creating interactions or other column expansions

Formula is Package Depend:

You have to go through each model one by one to see what type pre-processing are required for each different model.

  • Most packages for tree-based models use the formula interface but do not encode the categorical predictors as dummy variables.
  • Packages can use special inline functions that tell the model function how to treat the predictor in the analysis. For example, in survival analysis models, a formula term such as strata(site) would indicate that the column site is a stratification variable. This means it should not be treated as a regular predictor and does not have a corresponding location parameter estimate in the model.
  • A few R packages have extended the formula in ways that base R functions cannot parse or execute. In multilevel models (e.g., mixed models or hierarchical Bayesian models), a model term such as (week | subject) indicates that the column week is a random effect that has different slope parameter estimates for each value of the subject column.

A workflow is a general purpose interface. When add_formula() is used, how should the workflow preprocess the data? Since the pre-processing is model dependent, workflows attempts to emulate what the underlying model would do whenever possible. If it is not possible, the formula processing should not do anything to the columns used in the formula. Let’s look at this in more detail.

Special Formula/In-line Function

Because standard R methods cannot properly process this formula this will result in error.

library(lme4)
library(nlme)
data("Orthodont")
lmer(distance ~ Sex + (age | Subject), data = Orthodont)
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ Sex + (age | Subject)
   Data: Orthodont
REML criterion at convergence: 471.1635
Random effects:
 Groups   Name        Std.Dev. Corr 
 Subject  (Intercept) 7.3915        
          age         0.6943   -0.97
 Residual             1.3101        
Number of obs: 108, groups:  Subject, 27
Fixed Effects:
(Intercept)    SexFemale  
     24.517       -2.146  
model.matrix(distance ~ Sex + (age | Subject), data = Orthodont)
Warning in Ops.ordered(age, Subject): '|' is not meaningful for ordered factors
     (Intercept) SexFemale age | SubjectTRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Sex
[1] "contr.treatment"

attr(,"contrasts")$`age | Subject`
[1] "contr.treatment"

However, use add_model or add_variables solve this problem

library(multilevelmod)

multilevel_spec <- linear_reg() %>% set_engine("lmer")

multilevel_workflow <- 
  workflow() %>% 
  # Pass the data along as-is: 
  add_variables(outcome = distance, predictors = c(Sex, age, Subject)) %>% 
  add_model(multilevel_spec, 
            # This formula is given to the model
            formula = distance ~ Sex + (age | Subject))

multilevel_fit <- fit(multilevel_workflow, data = Orthodont)
multilevel_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Variables
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Outcomes: distance
Predictors: c(Sex, age, Subject)

── Model ───────────────────────────────────────────────────────────────────────
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ Sex + (age | Subject)
   Data: data
REML criterion at convergence: 471.1635
Random effects:
 Groups   Name        Std.Dev. Corr 
 Subject  (Intercept) 7.3915        
          age         0.6943   -0.97
 Residual             1.3101        
Number of obs: 108, groups:  Subject, 27
Fixed Effects:
(Intercept)    SexFemale  
     24.517       -2.146  

Use Multiple Model at Once

location <- list(
  longitude = Sale_Price ~ Longitude,
  latitude = Sale_Price ~ Latitude,
  coords = Sale_Price ~ Longitude + Latitude,
  neighborhood = Sale_Price ~ Neighborhood)
  
  
library(workflowsets)

location_models <- workflow_set(preproc = location, models = list(lm = lm_model))
location_models
# A workflow set/tibble: 4 × 4
  wflow_id        info             option    result    
  <chr>           <list>           <list>    <list>    
1 longitude_lm    <tibble [1 × 4]> <opts[0]> <list [0]>
2 latitude_lm     <tibble [1 × 4]> <opts[0]> <list [0]>
3 coords_lm       <tibble [1 × 4]> <opts[0]> <list [0]>
4 neighborhood_lm <tibble [1 × 4]> <opts[0]> <list [0]>

If you ever want to fit these model you have to use `purrr::map` which is actually intuitive for R user.

Right now these data.frames are all empty.

location_models <-
   location_models %>%
   mutate(fit = map(info, ~ fit(.x$workflow[[1]], ames_train)))
location_models
# A workflow set/tibble: 4 × 5
  wflow_id        info             option    result     fit       
  <chr>           <list>           <list>    <list>     <list>    
1 longitude_lm    <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
2 latitude_lm     <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
3 coords_lm       <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
4 neighborhood_lm <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>

Evaluate Test Set use `last_fit()` method

final_lm_res <- last_fit(lm_wflow, ames_split)
final_lm_res
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits             id               .metrics .notes   .predictions .workflow 
  <list>             <chr>            <list>   <list>   <list>       <list>    
1 <split [2342/588]> train/test split <tibble> <tibble> <tibble>     <workflow>

…the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes pre-processing steps and operations taken after a model is fit. We introduced a concept called a model workflow that can capture the important components of the modeling process. Multiple workflows can also be created inside of a workflow set. The last_fit() function is convenient for fitting a final model to the training set and evaluating with the test set.

For the Ames data, the related code that we’ll see used again is:

library(tidymodels)
data(ames)

## normalise y
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

## split data
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

## linear models
lm_model <- linear_reg() %>% set_engine("lm")

## validating result
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))

lm_fit <- fit(lm_wflow, ames_train)

Feature Engineering with Receipy

Syntax to use with recipe

USAGE:

  • Start with recipe() function call
  • begin with a series of step_*
## create a receipy object
simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal_predictors())

## add a receipy
lm_wflow %>% 
  add_recipe(simple_ames)

Compare Receipy with Standard Linear Model with formula

When this function is executed, the data are converted from a data frame to a numeric design matrix (also called a model matrix) and then the least squares method is used to estimate parameters.

A Standard Linear Model:

lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)

Use Receipy:

library(tidymodels) # Includes the recipes package
tidymodels_prefer()

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal_predictors())
simple_ames
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:   1
predictor: 4
── Operations 
• Log transformation on: Gr_Liv_Area
• Dummy variables from: all_nominal_predictors()
#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Operations
#> • Log transformation on: Gr_Liv_Area
#> • Dummy variables from: all_nominal_predictors()

Receipy is more verbal but more flexible use of formula:

Okay why not use formula?

  • These computations can be recycled across models since they are not tightly coupled to the modeling function.
  • A recipe enables a broader set of data processing choices than formulas can offer.
  • The syntax can be very compact. For example, all_nominal_predictors() can be used to capture many variables for specific types of processing while a formula would require each to be explicitly listed.
  • All data processing can be captured in a single R object instead of in scripts that are repeated, or even spread across different files.

Note on removing existing pre-processor before adding receipy

lm_wflow %>% 
  add_recipe(simple_ames)
Error in `add_recipe()`:
! A recipe cannot be added when a formula already exists.

You will have to remove existing preprocessor before adding recipe

lm_wflow <- 
  lm_wflow %>% 
  remove_variables() %>% 
  add_recipe(simple_ames)
Warning: The workflow has no variables preprocessor to remove.
Error in `add_recipe()`:
! A recipe cannot be added when a formula already exists.
lm_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Sale_Price ~ Longitude + Latitude

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Typical Pre-Processing in Statiscs

Note

This section include two typical treatment

  • For nominal value, you may consider drop low-frequency terms step_other.
  • The second is for interaction terms. Combined effect is higher than addictive specify by using `step_interact
  • spine function (non-linear relationship), typically used for coordinate step_ns
    • I like to think of spine function as a stretching sheet.
  • PCA feature extraction technique (use step_normalise)

Consider Encode Normial Values

d = ames_train |> 
  count(Neighborhood) |> 
  mutate(freqency = n / sum(n))


highest_n_at_0.01 = d |> 
  filter(freqency <= 0.01) |> 
  filter(n == max(n)) |> 
  pull(n)
  
d |> 
  ggplot(aes(y=Neighborhood,x=n)) + 
  geom_col() +
  gghighlight::gghighlight(n <= highest_n_at_0.01) + 
  ggtitle("These low frequency variables can be problematic")

  • Norminal Values: Consider chunk low frequency category into others this step you would use step_other;
  • step_dummy(all_nominal_predictors);
simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors())

Consider Interation Terms: Variable Can Interact with One and Other

Interactions are defined in terms of their effect on the outcome and can be combinations of different types of data (e.g., numeric, categorical, etc). Chapter 7 of M. Kuhn and Johnson (2020) discusses interactions and how to detect them in greater detail.

… two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone.

Consider Interaction as group_by recalculate regression in

ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) + 
  geom_point(alpha = .2) + 
  facet_wrap(~ Bldg_Type) + 
  geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "lightblue") + 
  scale_x_log10() + 
  scale_y_log10() + 
  labs(x = "Gross Living Area", y = "Sale Price (USD)")

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  # Gr_Liv_Area is on the log scale from a previous step
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") )

Spine Function (Non-Linear Relationship)

library(patchwork)
library(ggplot2)
library(splines)

plot_smoother <- function(deg_free) {
  ggplot(ames_train, aes(x = Latitude, y = 10^Sale_Price)) + 
    geom_point(alpha = .2) + 
    scale_y_log10() +
    geom_smooth(
      method = lm,
      formula = y ~ ns(x, df = deg_free),
      color = "lightblue",
      se = FALSE
    ) +
    labs(title = paste(deg_free, "Spline Terms"),
         y = "Sale Price (USD)") +
    theme_minimal()
}

# plot_smoother(2) 
( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )

The example use case here is for coordinates, which is for

recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, deg_free = 20)
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:   1
predictor: 5
── Operations 
• Log transformation on: Gr_Liv_Area
• Collapsing factor levels for: Neighborhood
• Dummy variables from: all_nominal_predictors()
• Interactions with: Gr_Liv_Area:starts_with("Bldg_Type_")
• Natural splines on: Latitude

Feature Extraction (Dimension Reduction Techniques)

The typical one you will see is PCA, But there exists more dimension reduction technique for example:

  • ICA Independent Component Analysis
  • NNMF Non-Negative Matrix Factorization
  • Multidimensional Scaling (MDS)
  • Uniform Manifold Approximation and Projection(UMAP)

Row Sampling Steps

This is a technique said to improve distribution but not performance.

Natual Language Sampling

Model Effectiveness Measurement

Chapter Extract

The typical statistical analysis workflow is analyzing different performance matrix given data.But in sum, you should think of measurement as these:

  • Accuracy Measurement (rmse)
  • Effectiveness Measurement (rsquare)
  • Implication Measuremnt (rsq)

Classical Measure for Normalized value are these three:

  • rmse
  • rsq
  • mae

For Binary Data there are:

  • conf_mat confusion matrix
  • accuracy
  • mcc Matthews correlation coefficient
  • f_meas F1 metric

When you use predicted probabilities as input rather than hard class predictor, the one matrix is ROC (ding ding ding!)

  • roc_curve

There is a use-full function autoplot let you do this.

Ultimately this gives you the power to compare different performance matrix easily

Yardstick Basic Usage

Yardstick is tool that produce performance matrix with consistent interface.

## Create a prediction frame
ames_test_res <- predict(lm_fit, new_data = ames_test %>% select(-Sale_Price))
ames_test_res
#> # A tibble: 588 × 1
#>   .pred
#>   <dbl>
#> 1  5.07
#> 2  5.31


## Bind prediction with Actual value
ames_test_res <- bind_cols(ames_test_res, ames_test %>% select(Sale_Price))
ames_test_res
#> # A tibble: 588 × 2
#>   .pred Sale_Price
#>   <dbl>      <dbl>
#> 1  5.07       5.02
#> 2  5.31       5.39

## YARDSTICK!! given a dataframe just do these two
rmse(ames_test_res, truth = Sale_Price, estimate = .pred)


## compare multiple matrix
ames_metrics <- metric_set(rmse, rsq, mae)
ames_metrics(ames_test_res, truth = Sale_Price, estimate = .pred)